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Abstract 

Lexical semantic resources, like WordNet, are often used in real applications of natural language document processing. For example, 
we integrated GermaNet in our document suite XDOC. In addition to hypemymy and synonymy relations, we want to exploit GermaNet 
verb frames for our analysis. In this paper, we outline an approach for the domain related enrichment of GermaNet verb frames by corpus 
based syntactic and co-occurrence data analyses of real documents. 



1. Introduction 

Lexical resources, like WordNet ( |FeUbaum, 1998t , 
GermaNet pamp and Feldweg, 19971 IKunze, 2001) , or 
EuroWordNet, will be more than ever applied as re- 
sources in applications, like Text Mining and Data Min- 
ing. Often it is necessary to extend or to adapt the 
resources for the application (see also jVossen, 200 1| 
INavigh and Velardi, 2002 1). 

In ( IKunzeandRdsner, 2003} we presented the integra- 
tion of the lexical resource GermaNet into our document 
processing system XDOC. In XDOC, GermaNet is used as 
linguistic (lexical) resource for tasks like 

• semantic tagging of tokens, 

• case frame analysis, and 

• semantic interpretation of syntactic structures (SIsS). 

One problem for the integration of GermaNet resources was 
the usage of GermaNet's verb frames. The problem there 
was that the information encoded in GermaNet verb frames 
is not sufficient (i.e., not detailed enough) for usage within 
the case frame analysis of XDOC. 

Case frame analysis needs detailed information about 
the syntactic form (e.g., which preposition and which case 
of the PP), the semantic category of the filler of a relation, 
and which thematic role is described by the relation (e.g., 
agent, location, etc.) for a complement of a verb. 

GermaNet's verb frames have two deficits with respect 
to usage in XDOC: 

1. for verbs, the information given is incomplete (e.g., 
preposition, semantic category, and thematic role are 
missing) and 

2. for nouns, no frame information is available. 

The information missing can be classified into sev- 
eral types: lexical (preposition), syntactic (case of noun 
phrase in a preposition phrase') and semantic (cate- 
gory of the filler and the thematic role of the re- 
lation described). The creation or manual adaption 



'For example, the preposition in can required a NP with the 
case accusative or dative. 



of GermaNet's resources is time-consuming. Related 
works included the automatic building of subcategori- 
sation lexicons for German verbs ( Wauschkuhn, 1999] 
ISchulte im Walde, 2002t and the automatic identification 
of thematic roles (see (Gildea and Hockenmaier, 2003 1; 
({Gildea a nd Jurafsky, 2002]| ) by exploitation of a syntacti- 
cally parsed corpus as input data. The method described 
in this paper extracts the necessary syntactic information 
and information about the required semantic categories for 
possible complements of a verb. This approach uses a cor- 
pus annotated with chunks (noun phrases and prepositional 
phrases) and GermaNet's verb frames information. 

This paper is organized as follows: first we give a short 
description about the evaluation environment. After this, 
we present the methodology by discussing an example. 
This is followed by the presentation and discussion of re- 
sults from our experiments. 

2. Evaluation Environment 
2.1. Evaluation Corpus 

For our work we used a corpus of medical documents 
in German (forensic autopsy protocols) with more than 1 
million running word forms. The autopsy protocols have 
a strictly defined content and layout. They are separated 
into different document parts, e.g. findings, background, 
discussion, death causes, etc. Each document part has its 
own characteristics (sub-language). 

The analyses with XDOC are concentrated on the sec- 
tions of findings, background and discussion. The findings 
section contains a high ratio of nouns and adjectives and 
syntactic (sentence) structures are mostly short. This sec- 
tion describes the medical findings in an everyday vocabu- 
lary without domain specific (medical) terms. A standard 
distribution of all word classes and regular syntactic struc- 
tures occurs in the background and discussion sections. The 
background section describes, for example, the details of 
a traffic accident, while the section discussion contains a 
combination of the results of the finding section and the 
facts reported in the background section. Both document 
parts contains a high and multifaceted number of named 
entities (NE). For example, each forensic autopsy protocol 
has a registration number (e.g., G 123/45), which is often 



referred to in the document. Furthermore, other NEs like 
date specifications, names of locations (e.g., streets, cities), 
or names of persons etc., occur in the texts. 

The analysis described in this paper is only concentrated 
on the document parts background and discussion. These 
parts were chosen, because both document parts contain 
regular syntactic structures of German and a minor ratio 
of domain specific terms. The tokens seem to belong much 
more likely to everyday language than to a sub-language of 
a specific domain. 

2.2. Tools and Resources 

For the adaptation of GermaNet verb frames, a syntac- 
tically annotated corpus is required. For this and other 
preprocessing steps, the document suite XDOC^ is used. 
In particular, following preprocessing steps of XDOC are 
used: 

• sentence splitter, 

• POS tagger, 

• syntactic parser. 

These methods output their results as XML structures, 
which are accepted by the subsequent processing steps 
based on XML structures as their input. For the extraction 
of relevant information (syntactic structures), XSL trans- 
formation is applied (Clark, 1999) . 

The quality of expected results is strongly dependent 
on the quality of input data, especially the results of the 
chart parser. The syntactic parser of XDOC is a bottom- 
up chart parser, which works with a context free grammar 
for German (ca. 400 rules). The robust XDOC syntactic 
parser outputs sentences completely parsed (readings) or 
only structures partially parsed (coverings). In these cov- 
erings, basic structures, like noun phrases or prepositional 
phrases (frequent elements in GermanNet verb frames) are 
annotated by XDOC's parser (see Fig. 0. 

3. An Outline of the Approach 

The basic assumption for the approach is that in a cor- 
pus with similar texts (news, expert's report, abstracts, etc.) 
a frequent verb co-occurs with the same complements. The 
complements of such a verb often appear in a similar syn- 
tactic structure at the same position in a sentence. Further, 
the fillers have the same semantic category. In the case of 
the autopsy corpus the number of authors of the documents 
is small. This results in a high rate of repetition of specific 
wordings or phrases, because authors have the tendency to 
use the same phrase for the description of similar facts (au- 
thor style). 

The steps of the procedure are: 

• use the verb frames given by GermaNet as simple pat- 
terns for the recognition of potential candidates for 
case frames in the corpus, 

• extract information about prepositions used in a com- 
plement (element of the case frame candidate), and 



^For a full description of the methods inside XDOC see 
jRosner and Kunze, 2002} . 



• count the occurrences of similar semantic fillers 
(roles) for an element of the case frame. 

The approach is presented by considering the verbs: 
verstarb (to pass away), kollidieren (to collide), befahren 
(to cruise), operieren (to operate) as examples. 



verb 


occurrences 


frame information from GermaNet 


kollidieren 


34 


NN.Pp 


operieren 


14 


NN.AN or NN.AN.BL 


versterben 


128 


NN.BT 


erfolgen 


187 


NE.AN or NN.PP 


befahren 


59 


NN.AN or NN.AN.AZ or NN.AN.BM 


ereignen 


29 


NE or NE.AR or NN.AR.BT or NN.BL 



Table 1 : Verbs and their GermaNet verb frame information. 



All these verbs have only one sense in GermaNet with 
the correct meaning for our cases. But multiple verb frames 
can be assigned to a sense, see for example the verb be- 
fahren (see also tableflj^ with 3 verb frames. 

GermaNet verb frames characterise the syntactic sub- 
categorisation of a verb. The elements of a verb frame de- 
scribe different complements of verbs. For example, 'AW' 
or 'AN' stand for a noun phrase in case nominative resp. ac- 
cusative, 'PP' represents a prepositional phrase but without 
information about the concrete preposition used. Both ele- 
ments give no information about the semantic category of 
the role filler and the thematic role. Only elements Uke, 
'BM' or 'BL' (stands for an adverbial complement or a 
prepositional phrase, which indicates a manner or local 
complement), give some semantic restriction for the filler 

In the following, we do sketch the approach: The initial 
basis is a corpus of documents, which are separated into a 
sequence of sentences. 

To complete the case frame analysis of a frequent verb 
in the corpus, all sentences in which the verb occurs were 
selected. These sentences are annotated by the POS Tagger 
of XDOC and are parsed by the syntactic parser of XDOC. 
For the analysis, only the annotation of NPs and PPs is re- 
quired. In this case, the grammar of the chart parser was 
reduced to 15 rules for the annotation of basic structures, 
like noun phrases and prepositional phrases. 

According to the verb frame information of GermaNet, 
possible candidates are selected from the structures parsed 
by the usage of XSL transformations. For example, for ele- 
ments like AW and AW, noun phrases with the case nomina- 
tive or accusative resp. are selected. Elements like PP are 
prepositional phrases with non specified case or preposi- 
tion. Other elements are ambiguous, Uke the element BM. 
The syntactic realisation of BM could be a prepositional 
phrase or an adverb. In this case, both realizations must be 
considered during search and analysis. 

The GermaNet verb frame of the verb collide contains 
the following information: 'NN.Pp' - a noun phrase in case 
nominative and an opf/o«flZ prepositional phrase. Following 
sentences occur in the corpus: 

^^A detailed description of the notation of the verb frames is 
available at: http://www.sfs.nphil.uni-tuebingen.de/lsd/ 

"'The GermaNet notation 'PP' means a required prepositional 
phrase. 



<COVERING NR="1"> 
<XXX>Beide</XXX> 

<V ROOT="befind- FLEX="FIN">befanden<A'> 
<REFPRO>sich</REFPRO> 
<PP RULE="PP1" CAS="DAT"> 
<PRP CAS="DAT">am</PRP> 

<NP TYPE="FULL" RULE="NP1" CAS="DAT" NUM=-PL" GEN="J'> 

<ADJ>rechten</ADJ> 

<N SRC="UCl">Fahrbahnrand</N> 

</NP> 

</PP> 

<IP>,</IP> 

<S-KONJ>dls</S-KONJ> 

<DETI>ein</DETI> 

<NE>Md2dQ</NE> 

<NR>323</NR> 

<ADJP RULE="ADJP1"> 

<XXX AS="ADV>beide</XXX> 

<XXX AS='ADJ">ueberholte</XXX> 

</ADJP> 

<K-KONJ>und</K-KONJ> 
<ADV>dabei</ADV> 
<PP RULE="PP1- CAS="DAT"> 
<PRP CAS="DAT">mit</PRP> 

<NP TYPE="FULL" RULE="NP2" CAS="DAT" NUM="SG" GEN="NTR"> 

<DETD>dem</DETD> 

<N SRC="UCl">Rddfdhrer</N> 

</NP> 

</PP> 

<V ROOT="kollidier" FLEX="FIN">kollidierte</V> 

<IP>.</IP> 

</COVERING> 



Figure 1 : A syntactically parsed sentence with NPs and PPs 
chunks. 



• Der erste Hanger kollidierte vermutlich mit der 
vorderen rechten Seite mit einem ... Haus. 



• ... sein LKW kollidierte mit dem PKW. 



• Der Pkw ... kollidierte mit 3 Begrenzungsstaben. 



• Der... Pkw Peugeot hingegen kollidierte frontal mit 
dem Pkw Renault 



• Nachfolgend kollidierten 3 Pkw mit dem VW Golf. 



The first assignment ('NN') is easy to handle, all noun 
phrases with the case nominative are selected. The follow- 
ing instances can be assigned to the 'NN' element of the 
verb frame given the sentences above: 



• AW." der erste Hanger, sein LKW, der Pkw, der Pkw 
Peugeot, 3 Pkw 



Semantic classification uses information available in 
GermaNet. In the first step, GermaNet top level informa- 
tion is used as a shallow classification. To improve the clas- 
sification, the hypernymy tree information of GermaNet is 
exploited. 

For the verb collide, following two types of fillers for 
the 'NN' element are encountered in the corpus. The first 
type is a person referenced by pronouns (11), registration 
numbers (6), like G 1234/11, or by a person name (1). And 
the second type of the filler is a vehicle (16). Both types 
describe traffic participants (road users). 

In the next step, the occurrences of prepositions in these 
sentences are counted. The verb collide occurs in 34 sen- 
tences within the corpus. The frequent prepositions used in 
these sentences are presented in table|2l 



preposition 


ratio 


semantic 


mit (dat) 


54 


temporal, instrument, modal, causal 


auf (dat, acc) 


23 


local, temporal, modal, causal 


aus (dat) 


17 


local, causal 


am (dat) 


17 


modal, temporal 


nach (dat) 


15 


local, temporal, final, modal 


in (dat, acc) 


13 


local, temporal, modal 


von (dat) 


11 


local, temporal, modal 



Table 2: High frequent prepositions co-occurred with the 
verb collide. 



Table|2lshows, that different prepositions are possible as 
the indicator for the prepositional phrase of the verb frame. 
For the selection of the correct preposition, following as- 
sumptions are used. The PPs co-occurring with instances of 
the verb kollidieren (collide) are evaluated. Most preposi- 
tions allow for different semantic interpretation (cf. |2j. Dis- 
ambiguation is only possible when taking the classification 
of the embedded NP into account. Only PPs that are not 
referring to local or temporal circumstances are counted, 
because temporal or local adjuncts can co-occur with most 
verbs. ^ From the remaining PPs the preposition with the 
highest frequency is taken as candidate for the derived case 
frame. 

Furthermore, this approach can be enhanced, when the 
information about the distance between verb and potential 
prepositional phrase complement is exploited. In a sen- 
tence, it is possible that the same preposition can occur 
more than once in a PP. Coordination is one of example: 

• Nach Angaben der anwesenden Kliniker soil er mit 
einem PKW von der Fahrbahn abgekommen sein und 
dort mit feststehenden Gegenstdnden kollidiert sein. 

For examples like this, the approach described can be 
enhanced in the following way: Only clause structures in- 
stead of whole sentences are explored, because more than 
one verb occur in a sentence. The clauses in these sentences 
are splitted by commas or by the conjunction ' und' (and). In 
addition, a heuristic is used: Only PPs situated next to the 
verb, before or after the verb (scope of a verb), are anal- 
ysed. Further work will be the refinement of this simple 
heurisics. 

For the verb collide the following prepositions were en- 
countered as results: 27 times 'mit', 5 times 'nach', twice 
the prepositions 'beim', 'am', 'als, and once the preposition 
'in'. 

The filler of the 'Pp' with the preposition 'mit' can be 
assigned to the semantic category 'solid object'. 

• Pp: mit einem Pkw, mit einem Baum, mit dem Mer- 
cedes, mit der Mittelleitplanke, mit einem Verkehrs- 
schild 

In sum, the approach results in the following extended 
verb frame of the verb collide: 



^The treament is different if the GermaNet patterns contain 
explicit elements about locale or temporal information, like BT or 
BL. 



• The filler of 'NN' element can be either a person (e.g., 
NE, pronoun) or a vehicle (e.g., a 'regular' noun, like 
PKW). 

• The 'Pp' element describes an object in the syntactic 
form of a PP with the preposition 'mit' and the case 
dative. In this case, the semantic category of the filler 
is the category solid object. 

For the verb befahren there exist three verb frames in 
GermaNet, each consisting of the elements AW and AA^. For 
these elements, the approach described above delivers fol- 
lowing details: 

• AW: The results contain here again instances of traffic 
participants. The first subcategory describes persons, 
in form of NEs (registration number or name of the 
person), pronouns, or with the noun 'driver' in phrases 
like driver of the 'car'. The second subcategory is pre- 
sented through vehicles, like NEs {PKW VW Lupo) or 
as nouns, like car, tramway, motor vessel etc. 

• AW: All possible candidates (e.g., street, German free- 
way, avenue, canal, etc.), which are found in Ger- 
maNet, could be assigned to traffic route. 

The additional elements AZ^ and BM were not analysed, 
because up to now this work was restricted to the enrich- 
ment of elements, like noun and prepositional phrases. The 
extension to other possible elements in a verb frame is part 
of our future work. 

4. Discussion 

The results obtained via this approach can support a de- 
signer for verb frames. Based on the verb frame informa- 
tion in GermaNet, this method delivers possible candidates 
of fillers for noun and prepositional phrases . The results 
contains information about the semantic category and syn- 
tactic form (and elements, like prepositions) of case roles 
fillers. 

The results are dependent on a good lexical coverage to 
get the correct semantic information for a filler and strongly 
dependent on the correct annotation of syntactic structures. 
The number of the coverings delivered for a sentence by the 
parser can be reduced. At first only coverings are allowed, 
which are in accordance with GermaNet verb frames. Sec- 
ond, only relevant parts (clause) of a complex sentence are 
extracted. 

One problem, which occurs was the correct handling of 
NEs in the corpus. In addition to date or time information, 
the approach must cover names of locations (e.g., streets, 
like A 9), names of vehicles (e.g., Opel Frontera), names of 
persons (e.g., Mr Miller'), and registration numbers (e.g., 
of persons: G 1345/78; or licence plate numbers: ABZAB- 
789). 

5. Conclusion 

In this paper, we described an approach for the en- 
richment of GermaNet's verb frames. It is based on co- 
occurrence data analyses of a corpus of forensic autopsy 
protocols. 

^ zu-infinitive complement 



For the approach described above, the document parts 
background and discussion from the forensic autopsy pro- 
tocol were used. These parts were chosen, because in these 
parts, a minor ratio of domain specific terms occurs. The 
form and the content of the background section are similar 
to a report in a newspaper. 

The approach outlined here is based on structural and 
syntactical analysis and on the analysis of co-occurrence 
data. These co-occurrence data were verbs and syntactical 
structures in the neighborhood of the verbs. The quality 
of the results are strongly dependent on the results of the 
syntactic parser and the correct handling of named entities. 
Both can be enhanced by an improvement of the domain 
specific resources, like the grammar of the chart parser Our 
future work will be to confirm and to evaluate the approach 
with another corpus, for example the EUROPARL corpus - 
available at http://www.isi.edu/~koehrii 
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